The purpose of this document is to illustrate time series analysis and forecasting. We will use a simulated dataset to analyze things like visits, discharges and payments. To perform these analyses we will be following the modeltime workflow. This report will be broken down into sections that follow that same workflow.
Lets take a look at our data and see what it has.
df_tbl %>%
glimpse()## Rows: 25,279
## Columns: 12
## $ mrn <chr> "66727914", "84487881", "68427598", "39652414"~
## $ visit_id <chr> "1283065398", "1171004549", "1951203647", "149~
## $ visit_start_date_time <dttm> 2011-12-26 01:14:00, 2011-12-31 05:00:00, 201~
## $ visit_end_date_time <dttm> 2012-01-01 12:06:00, 2012-01-01 12:51:00, 201~
## $ total_charge_amount <dbl> 62580.61, 38466.48, 31758.50, 14699.61, 66096.~
## $ total_adjustment_amount <dbl> -39117.58, -26930.67, -23706.23, -10841.80, -7~
## $ total_payment_amount <dbl> -23463.03, -11535.81, -8052.27, -3857.81, -587~
## $ payer_grouping <chr> "Commercial", "Blue Cross", "Blue Cross", "Blu~
## $ service_line <chr> "Medical", "Surgical", "Medical", "Chest Pain"~
## $ ip_op_flag <chr> "I", "I", "I", "I", "I", "O", "I", "I", "O", "~
## $ adm_date <date> 2011-12-26, 2011-12-31, 2011-12-28, 2011-12-3~
## $ dsch_date <date> 2012-01-01, 2012-01-01, 2012-01-01, 2012-01-0~
skim(df_tbl)| Name | df_tbl |
| Number of rows | 25279 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| Date | 2 |
| numeric | 3 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| mrn | 0 | 1 | 8 | 8 | 0 | 16789 | 0 |
| visit_id | 0 | 1 | 10 | 10 | 0 | 25279 | 0 |
| payer_grouping | 0 | 1 | 10 | 10 | 0 | 2 | 0 |
| service_line | 0 | 1 | 2 | 44 | 0 | 27 | 0 |
| ip_op_flag | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| adm_date | 0 | 1 | 2011-12-19 | 2019-12-31 | 2015-05-27 | 2916 |
| dsch_date | 0 | 1 | 2012-01-01 | 2019-12-31 | 2015-05-29 | 2887 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| total_charge_amount | 0 | 1.00 | 34260.35 | 48285.83 | 0.5 | 10847.83 | 19475.11 | 39463.87 | 1109001.99 | ▇▁▁▁▁ |
| total_adjustment_amount | 0 | 1.00 | -22550.46 | 36053.24 | -914728.0 | -25220.65 | -11619.57 | -6951.65 | 63627.62 | ▁▁▁▁▇ |
| total_payment_amount | 586 | 0.98 | -11584.37 | 18165.44 | -495119.8 | -13132.33 | -5999.98 | -2920.08 | 436.01 | ▁▁▁▁▇ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| visit_start_date_time | 0 | 1 | 2011-12-19 21:33:00 | 2019-12-31 05:00:00 | 2015-05-27 01:55:00 | 16731 |
| visit_end_date_time | 0 | 1 | 2012-01-01 12:06:00 | 2019-12-31 22:58:00 | 2015-05-29 00:00:00 | 15184 |
Our objectives are to:
Our forecasting will focus on a grouped forecast where we are going to forecast the number of discharges by inpatient/outpatient visit type and by payer grouping.
We are going to do this on a weekly scale.
df_tblsummarise_by_time() with .by = "week", and n() the visits.transactions_weekly_tbltransactions_weekly_tbl <- df_tbl %>%
filter(payer_grouping != "?") %>%
group_by(ip_op_flag, payer_grouping) %>%
summarise_by_time(
.date_var = dsch_date
, .by = "week"
, value = n()
)
transactions_weekly_tbl## # A tibble: 1,667 x 4
## # Groups: ip_op_flag, payer_grouping [4]
## ip_op_flag payer_grouping dsch_date value
## <chr> <chr> <date> <int>
## 1 I Blue Cross 2012-01-01 37
## 2 I Blue Cross 2012-01-08 31
## 3 I Blue Cross 2012-01-15 36
## 4 I Blue Cross 2012-01-22 32
## 5 I Blue Cross 2012-01-29 40
## 6 I Blue Cross 2012-02-05 38
## 7 I Blue Cross 2012-02-12 31
## 8 I Blue Cross 2012-02-19 37
## 9 I Blue Cross 2012-02-26 38
## 10 I Blue Cross 2012-03-04 31
## # ... with 1,657 more rows
Use plot_time_series() to visualize the discharges.
log() transformation to see the effect on the time seriestransactions_weekly_tbl %>%
plot_time_series(
.date_var = dsch_date
, .color_var = payer_grouping
, .facet_vars = payer_grouping
, .facet_ncol = 2
, .value = log(value)
, .smooth = FALSE
)Visualize the ACF using plot_acf_diagnostics() using a log() transformation. Look for:
transactions_weekly_tbl %>%
ungroup() %>%
plot_acf_diagnostics(dsch_date, log(value))transactions_weekly_tbllog()standardize_vec()transactions_trans_weekly_tbltransactions_trans_weekly_tbl <- transactions_weekly_tbl %>%
mutate(value = log(value)) %>%
mutate(value = standardize_vec(value))## Standardization Parameters
## mean: 3.08875144281386
## standard deviation: 0.367674566335952
## Standardization Parameters
## mean: 1.83577890003612
## standard deviation: 0.545791389303644
## Standardization Parameters
## mean: 3.15330156564258
## standard deviation: 0.302421031976675
## Standardization Parameters
## mean: 1.59951348649452
## standard deviation: 0.514947645076106
mean_a <- 3.08875144281386
sd_a <- 0.367674566335952
mean_b <- 1.83577890003612
sd_b <- 0.545791389303644
mean_c <- 3.15330156564258
sd_c <- 0.302421031976675
mean_d <- 1.59951348649452
sd_d <- 0.514947645076106Visualize the log-standardized transactions using plot_time_series(). This confirms the transformation was performed successfully.
transactions_trans_weekly_tbl %>%
plot_time_series(
.date_var = dsch_date
, .color_var = payer_grouping
, .facet_vars = payer_grouping
, .facet_ncol = 2
, .value = value
, .smooth = FALSE
)We’ll use these parameters to create our “full dataset”. We’ve selected an 14-week forecast horizon. Our lag period is 14 weeks and we’ll try out a few rolling averages at various aggregations.
horizon <- 14
lag_period <- 14
rolling_periods <- c(7, 14, 28, 52)transactions_weekly_tblbind_rows() and future_frame() to extend the data frame .length_out = horizon.tk_augment_lags() to add a .lags = lag_periodtk_agument_slidify() to add .period = rolling_periods. Use mean as the rolling function. Make sure to “center” with “partial” windows.full_tbl.full_tbl <- transactions_trans_weekly_tbl %>%
# Add future window
bind_rows(
future_frame(
.data = .
, .date_var = dsch_date
, .length_out = horizon
)
) %>%
# Add autocorrelated lags
tk_augment_lags(value, .lags = lag_period) %>%
# Add rolling features
tk_augment_slidify(
.value = value_lag14,
.f = mean,
.period = rolling_periods,
.align = "center",
.partial = TRUE
) %>%
# Rename columns
rename_with(
.cols = contains("lag")
, .fn = ~ str_c("lag_", .)
) %>%
select(dsch_date, everything())
full_tbl %>%
glimpse()## Rows: 1,723
## Columns: 9
## Groups: lag_ip_op_flag, payer_grouping [4]
## $ dsch_date <date> 2012-01-01, 2012-01-08, 2012-01-15, 2012-01-2~
## $ lag_ip_op_flag <chr> "I", "I", "I", "I", "I", "I", "I", "I", "I", "~
## $ payer_grouping <chr> "Blue Cross", "Blue Cross", "Blue Cross", "Blu~
## $ value <dbl> 1.4201865, 0.9389710, 1.3456669, 1.0253210, 1.~
## $ lag_value_lag14 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ lag_value_lag14_roll_7 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ lag_value_lag14_roll_14 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ lag_value_lag14_roll_28 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ lag_value_lag14_roll_52 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
Visualize the features, and review what you see.
full_tblpivot_longer every column except “dsch_date”plot_time_series() to visualize the time series coloring by “name”.Review the visualization selecting one feature at a time and answering the following questions:
- Do the rolling lags present any issues?
- Which rolling lag captures the trend the best?
- Do you expect either of the Product Events features to help?
full_tbl %>%
pivot_longer(cols = -c(dsch_date, lag_ip_op_flag, payer_grouping)) %>%
plot_time_series(
dsch_date
, value
, name
, .smooth = FALSE
, .facet_ncol = 2
)Create a data_prepared_tbl by filtering full_tbl where “value” is non-missing.
data_prepared_tbl <- full_tbl %>%
filter(!is.na(value))
data_prepared_tbl## # A tibble: 1,667 x 9
## # Groups: lag_ip_op_flag, payer_grouping [4]
## dsch_date lag_ip_op_flag payer_grouping value lag_value_lag14
## <date> <chr> <chr> <dbl> <dbl>
## 1 2012-01-01 I Blue Cross 1.42 NA
## 2 2012-01-08 I Blue Cross 0.939 NA
## 3 2012-01-15 I Blue Cross 1.35 NA
## 4 2012-01-22 I Blue Cross 1.03 NA
## 5 2012-01-29 I Blue Cross 1.63 NA
## 6 2012-02-05 I Blue Cross 1.49 NA
## 7 2012-02-12 I Blue Cross 0.939 NA
## 8 2012-02-19 I Blue Cross 1.42 NA
## 9 2012-02-26 I Blue Cross 1.49 NA
## 10 2012-03-04 I Blue Cross 0.939 NA
## # ... with 1,657 more rows, and 4 more variables: lag_value_lag14_roll_7 <dbl>,
## # lag_value_lag14_roll_14 <dbl>, lag_value_lag14_roll_28 <dbl>,
## # lag_value_lag14_roll_52 <dbl>
Create a forecast_tbl by filtering full_tbl where “value” is missing.
forecast_tbl <- full_tbl %>%
filter(is.na(value))
forecast_tbl## # A tibble: 56 x 9
## # Groups: lag_ip_op_flag, payer_grouping [4]
## dsch_date lag_ip_op_flag payer_grouping value lag_value_lag14
## <date> <chr> <chr> <dbl> <dbl>
## 1 2020-01-05 I Blue Cross NA -1.22
## 2 2020-01-12 I Blue Cross NA -0.393
## 3 2020-01-19 I Blue Cross NA -0.860
## 4 2020-01-26 I Blue Cross NA -0.695
## 5 2020-02-02 I Blue Cross NA -1.88
## 6 2020-02-09 I Blue Cross NA -1.42
## 7 2020-02-16 I Blue Cross NA -1.42
## 8 2020-02-23 I Blue Cross NA -2.14
## 9 2020-03-01 I Blue Cross NA -0.860
## 10 2020-03-08 I Blue Cross NA -1.42
## # ... with 46 more rows, and 4 more variables: lag_value_lag14_roll_7 <dbl>,
## # lag_value_lag14_roll_14 <dbl>, lag_value_lag14_roll_28 <dbl>,
## # lag_value_lag14_roll_52 <dbl>
data_prepared_tbltime_series_split() to create a single time series split.
assess = horizon to get the last 14-weeks of data as testing data.cumulative = TRUE to use all of the previous data as training data.splitssplits <- data_prepared_tbl %>%
time_series_split(assess = horizon, cumulative = TRUE)Make a preprocessing recipe using recipe(). Note - It may help to prep() and juice() your recipe to see the effect of your transformations.
recipe() using “value ~ .” and data = training(splits)step_timeseries_signature() using the date featurestep_normalize().step_dummy(). Set one_hot = TRUE.recipe_spec_base <- recipe(
value ~ .
, data = training(splits) %>%
arrange(
lag_ip_op_flag
, payer_grouping
, dsch_date)
) %>%
# Time Series Signature
step_timeseries_signature(dsch_date) %>%
step_rm(matches("(iso)|(xts)|(hour)|(minute)|(second)|(am.pm)")) %>%
# Standardization
step_normalize(matches("(index.num)|(year)|(yday)")) %>%
# Dummy Encoding (One Hot Encoding)
step_dummy(all_nominal(), one_hot = TRUE) %>%
# Fourier - 4 Week ACF
step_fourier(dsch_date, period = c(7, 14, 52), K = 2)
recipe_spec_base %>%
prep() %>%
juice() %>%
glimpse()## Rows: 1,612
## Columns: 58
## $ dsch_date <date> 2012-01-01, 2012-01-08, 2012-01-15, 2012-01~
## $ lag_value_lag14 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ lag_value_lag14_roll_7 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ lag_value_lag14_roll_14 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ lag_value_lag14_roll_28 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ lag_value_lag14_roll_52 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ value <dbl> 1.4201865, 0.9389710, 1.3456669, 1.0253210, ~
## $ dsch_date_index.num <dbl> -1.725001, -1.716430, -1.707859, -1.699288, ~
## $ dsch_date_year <dbl> -1.505993, -1.505993, -1.505993, -1.505993, ~
## $ dsch_date_half <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ dsch_date_quarter <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,~
## $ dsch_date_month <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4,~
## $ dsch_date_day <int> 1, 8, 15, 22, 29, 5, 12, 19, 26, 4, 11, 18, ~
## $ dsch_date_wday <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ dsch_date_mday <int> 1, 8, 15, 22, 29, 5, 12, 19, 26, 4, 11, 18, ~
## $ dsch_date_qday <int> 1, 8, 15, 22, 29, 36, 43, 50, 57, 64, 71, 78~
## $ dsch_date_yday <dbl> -1.70440843, -1.63725147, -1.57009451, -1.50~
## $ dsch_date_mweek <int> 5, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1,~
## $ dsch_date_week <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1~
## $ dsch_date_week2 <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,~
## $ dsch_date_week3 <int> 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0,~
## $ dsch_date_week4 <int> 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3,~
## $ dsch_date_mday7 <int> 1, 2, 3, 4, 5, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2,~
## $ lag_ip_op_flag_I <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ lag_ip_op_flag_O <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ payer_grouping_Blue.Cross <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ payer_grouping_Commercial <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_01 <dbl> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_02 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_03 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,~
## $ dsch_date_month.lbl_04 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,~
## $ dsch_date_month.lbl_05 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_06 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_07 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_08 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_09 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_10 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_11 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_month.lbl_12 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_wday.lbl_1 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ dsch_date_wday.lbl_2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_wday.lbl_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_wday.lbl_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_wday.lbl_5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_wday.lbl_6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_wday.lbl_7 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ dsch_date_sin7_K1 <dbl> 0.37526700, 0.95866785, 0.82017225, 0.064070~
## $ dsch_date_cos7_K1 <dbl> 0.9269168, 0.2845276, -0.5721167, -0.9979454~
## $ dsch_date_sin7_K2 <dbl> 0.6956826, 0.5455349, -0.9384684, -0.1278772~
## $ dsch_date_cos7_K2 <dbl> 0.71834935, -0.83808810, -0.34536505, 0.9917~
## $ dsch_date_sin14_K1 <dbl> -0.1911586, -0.5981105, -0.8865993, -0.99948~
## $ dsch_date_cos14_K1 <dbl> -0.98155916, -0.80141362, -0.46253829, -0.03~
## $ dsch_date_sin14_K2 <dbl> 0.37526700, 0.95866785, 0.82017225, 0.064070~
## $ dsch_date_cos14_K2 <dbl> 0.9269168, 0.2845276, -0.5721167, -0.9979454~
## $ dsch_date_sin52_K1 <dbl> 0.78183148, 0.85128444, 0.90832376, 0.952117~
## $ dsch_date_cos52_K1 <dbl> 0.62348980, 0.52470449, 0.41826780, 0.305731~
## $ dsch_date_sin52_K2 <dbl> 0.9749279, 0.8933455, 0.7598452, 0.5821853, ~
## $ dsch_date_cos52_K2 <dbl> -0.22252093, -0.44937040, -0.65010409, -0.81~
Use plot_time_series_regression to test out several natural splines:
splines::ns() with degrees of freedom 1, 2, 3, and 4.Which value of df would you select?
data_prepared_tbl %>%
plot_time_series_regression(
.date_var = dsch_date,
.formula = value ~ splines::ns(dsch_date, df = 3),
.show_summary = FALSE,
.facet_ncol = 2
)